29 research outputs found
What, When and Where of petitions submitted to the UK Government during a time of chaos
In times marked by political turbulence and uncertainty, as well as
increasing divisiveness and hyperpartisanship, Governments need to use every
tool at their disposal to understand and respond to the concerns of their
citizens. We study issues raised by the UK public to the Government during
2015-2017 (surrounding the UK EU-membership referendum), mining public opinion
from a dataset of 10,950 petitions (representing 30.5 million signatures). We
extract the main issues with a ground-up natural language processing (NLP)
method, latent Dirichlet allocation (LDA). We then investigate their temporal
dynamics and geographic features. We show that whilst the popularity of some
issues is stable across the two years, others are highly influenced by external
events, such as the referendum in June 2016. We also study the relationship
between petitions' issues and where their signatories are geographically
located. We show that some issues receive support from across the whole country
but others are far more local. We then identify six distinct clusters of
constituencies based on the issues which constituents sign. Finally, we
validate our approach by comparing the petitions' issues with the top issues
reported in Ipsos MORI survey data. These results show the huge power of
computationally analyzing petitions to understand not only what issues citizens
are concerned about but also when and from where.Comment: Preprint; under revie
Directions in abusive language training data, a systematic review: Garbage in, garbage out
Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets
Islamophobes are not all the same! A study of far right actors on Twitter
Far-right actors are often purveyors of Islamophobic hate speech online,
using social media to spread divisive and prejudiced messages which can stir up
intergroup tensions and conflict. Hateful content can inflict harm on targeted
victims, create a sense of fear amongst communities and stir up intergroup
tensions and conflict. Accordingly, there is a pressing need to better
understand at a granular level how Islamophobia manifests online and who
produces it. We investigate the dynamics of Islamophobia amongst followers of a
prominent UK far right political party on Twitter, the British National Party.
Analysing a new data set of five million tweets, collected over a period of one
year, using a machine learning classifier and latent Markov modelling, we
identify seven types of Islamophobic far right actors, capturing qualitative,
quantitative and temporal differences in their behaviour. Notably, we show that
a small number of users are responsible for most of the Islamophobia that we
observe. We then discuss the policy implications of this typology in the
context of social media regulation
Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks
Labelled data is the foundation of most natural language processing tasks.
However, labelling data is difficult and there often are diverse valid beliefs
about what the correct data labels should be. So far, dataset creators have
acknowledged annotator subjectivity, but rarely actively managed it in the
annotation process. This has led to partly-subjective datasets that fail to
serve a clear downstream use. To address this issue, we propose two contrasting
paradigms for data annotation. The descriptive paradigm encourages annotator
subjectivity, whereas the prescriptive paradigm discourages it. Descriptive
annotation allows for the surveying and modelling of different beliefs, whereas
prescriptive annotation enables the training of models that consistently apply
one belief. We discuss benefits and challenges in implementing both paradigms,
and argue that dataset creators should explicitly aim for one or the other to
facilitate the intended use of their dataset. Lastly, we conduct an annotation
experiment using hate speech data that illustrates the contrast between the two
paradigms.Comment: Accepted at NAACL 2022 (Main Conference
The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models
In this paper, we address the concept of "alignment" in large language models
(LLMs) through the lens of post-structuralist socio-political theory,
specifically examining its parallels to empty signifiers. To establish a shared
vocabulary around how abstract concepts of alignment are operationalised in
empirical datasets, we propose a framework that demarcates: 1) which dimensions
of model behaviour are considered important, then 2) how meanings and
definitions are ascribed to these dimensions, and by whom. We situate existing
empirical literature and provide guidance on deciding which paradigm to follow.
Through this framework, we aim to foster a culture of transparency and critical
evaluation, aiding the community in navigating the complexities of aligning
LLMs with human populations.Comment: Socially Responsible Language Modelling Research (SoLaR) @ NeurIPs
202
Recommended from our members
Understanding RT’s Audiences: Exposure Not Endorsement for Twitter Followers of Russian State-Sponsored Media
The Russian state-funded international broadcaster RT (formerly Russia Today) has attracted much attention as a purveyor of Russian propaganda. To date, most studies of RT have focused on its broadcast, website, and social media content, with little research on its audiences. Through a data-driven application of network science and other computational methods, we address this gap to provide insight into the demographics and interests of RT’s Twitter followers, as well as how they engage with RT. Building upon recent studies of Russian state-sponsored media, we report three main results. First, we find that most of RT’s Twitter followers only very rarely engage with its content and tend to be exposed to RT’s content alongside other mainstream news channels. This indicates that RT is not a central part of their online news media environment. Second, using probabilistic computational methods, we show that followers of RT are slightly more likely to be older and male than average Twitter users, and they are far more likely to be bots. Third, we identify thirty-five distinct audience segments, which vary in terms of their nationality, languages, and interests. This audience segmentation reveals the considerable heterogeneity of RT’s Twitter followers. Accordingly, we conclude that generalizations about RT’s audience based on analyses of RT’s media content, or on vocal minorities among its wider audiences, are unhelpful and limit our understanding of RT and its appeal to international audiences